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Abstract 

A wide range of Bayesian models have been pro¬ 
posed for data that is divided hierarchically into 
groups. These models aim to cluster the data at 
different levels of grouping, by assigning a mix¬ 
ture component to each datapoint, and a mixture 
distribution to each group. Multi-level clustering is 
facilitated by the sharing of these components and 
distributions by the groups. In this paper, we intro¬ 
duce the concept of Degree of Sharing (DoS) for 
the mixture components and distributions, with an 
aim to analyze and classify various existing mod¬ 
els. Next we introduce a generalized hierarchical 
Bayesian model, of which the existing models can 
be shown to be special cases. Unlike most of these 
models, our model takes into account the sequential 
nature of the data, and various other temporal struc¬ 
tures at different levels while assigning mixture 
components and distributions. We show one spe¬ 
cialization of this model aimed at hierarchical seg¬ 
mentation of news transcripts, and present a Gibbs 
Sampling based inference algorithm for it. We also 
show experimentally that the proposed model out¬ 
performs existing models for the same task. 

1 Introduction 

In many applications we come across hierarchically grouped 
data. For example in a text corpus, data is grouped into docu¬ 
ments, paragraphs and sentences. Such data can be clustered 
at multiple levels, based on the notion of topics. A large num¬ 
ber of hierarchical Bayesian models have been proposed for 
such data, many of whom are quite similar to each other in 
various aspects. However, to the best of our knowledge, there 
has not been much research aimed at placing these models in 
perspective, and making a comparative study of them, except 
empirical comparisons. This is what we attempt in this paper. 
The main aspect of these models which we compare is how 
they share the mixture components and distributions across 
the groups at different levels. 

The contributions of this paper are as follows: 1) We in¬ 
troduce a novel classification of Hierarchical Bayesian mod¬ 
els for grouped data, based on Degree of Sharing of mixture 
components and distributions 2) We introduce a generalized 
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Figure 1: Data grouped at 3 levels: D] — i\fi, Df = 1,D§ = 
2,Di = 3,Di = 4 etc, Df = 1 for z = 1... 4, Df = 2 for 
z = 5 ... 8. Also, D^{1) = 1, D^{3) = 2 etc 

Hierarchical Bayesian model and show many existing ones to 
be special cases of it, and 3) We show how it can be adapted 
for news transcript segmentation, for which we give an infer¬ 
ence algorithm and demonstrate experimental results. 

2 Notations 

Consider N datapoints Yi, I 2 , • • •, of any type (eg. inte¬ 
gers, real-valued vectors) based on the application. Each of 
these are associated with group membership variables (pos¬ 
itive integers), which specify the grouping of the datapoints. 
If there are L levels of grouping, each datapoint Yi is asso¬ 
ciated with observed variables {Dl, D?,..., Df}. For ex¬ 
ample, a text corpus consists of a set of documents, each of 
which consists of word-tokens. We can consider the word- 
tokens as data-points {Yi}, which are tagged with their doc¬ 
ument memberships using {D^}, where {D^} are the token 
indices, to capture the sequential ordering. This is the stan¬ 
dard setting used in most topic models for text documents. In 
addition, it is possible to consider a 3-level grouping with sen¬ 
tences within documents. Then each word-token Yi is associ¬ 
ated with a sentence membership variable Df and a document 
membership variable Df. In this paper, we will overload 
(I > 1) to indicate the higher-level group-memberships of 
lower-level-groups. For example if g is the index of a level- 
2 group, then D^{g) is the level-3 group that covers all the 
datapoints under group g, i.e. D^{g) = where D‘f = g. 
Please see Figl^for illustration. 

Most topic models consider documents or sentences to be 
bags of words, and do not consider the sequential nature of 
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Figure 2: Grouped data clustered at 2 levels (/ = 1, 2). Colours in¬ 
dicate the clustering, like Z\ — Z\, Z‘^{2) — Z^(4) etc. Different 
colours used at the two levels. Note that Zl Z\, but Z\ — 


the data. This can be avoided with the current representa¬ 
tion, as sequential relations between the word-tokens can be 
encoded using the indices {D^} which takes integer values. 
Accordingly for each datapoint i we can define sequential 
neighbors prev{i) and next{i). Even sequential ordering of 
the higher-level groups like sentences and documents can be 
captured by the variables D‘^ and respectively. In case 
sequential ordering is irrelevant at any level (for example, or¬ 
dering of documents is usually not relevant unless there are 
timestamps), the group membership variables at that level act 
as simple identifiers. 

The groups at the different levels may be clustered in 
some applications, like multi-level clustering. For this, we 
associate a group cluster variable with each group-index: 

{Z^},..., {Z^}. Again, we can overload Z^ (/ > 1) 
to indicate the higher-level cluster memberships of lower- 
level groups. If g is the index of a level-2 group, then Z^{g) 
is the level-3 cluster that covers all the datapoints under i.e. 
Z^{g) = Z? where = g. This causes hierarchical clus¬ 

tering of the datapoints, specified by the tuple {Z/,..., Z/"}. 

A Bayesian modelling involves mixture components 
and mixture distributions. We will consider K mixture 
components (topics) where K may not be 

known. Also, we need mixture distributions for each level- 
..., {0^}. These are discrete distributions over 
index variables, that are cluster indices of the lower layer. 
Note that the cluster indices at level 1 are indices of the mix¬ 
ture components. At each level, the distributions may be spe¬ 
cific to the group clusters, defined by the group cluster vari¬ 
ables Z. For example, if the groups at level I are clustered, 
the groups in cluster k i.e. {j : Z\j) = k} will have access 
to only the distribution at level / — 1. The basic infer¬ 
ence problem is to learn the cluster assignments {Z^}, and 
estimate the mixture components (j). 


3 Review of Existing Models 

In this section we make a short review of several well-known 
models using the above notation. The models can be classi¬ 
fied based on the number of levels of grouping in the data that 
they consider. 


3.1 1-level models 

The si mplest models are the 1- level mixture models, like 
GMM I Bishop and others, 2006| . Here L = 1, with D] = i 


and the datapoints are not grouped at all. There are K 
mixture components {(j)} which are Gaussian distributions, 
i.e. (pk = -^(/i/c, F/j;). In general, the mixture components 
need not be Gaussian. The mixture distribution 0^ is sl K- 
dimensional multinomial. Each datapoint is assigned to a 
mixture component Zj , which defines a clustering of the dat¬ 
apoints. This assignment is IID as Zj ^ 0^ and sequential 
structure of the datapoints is not considered. 

In GMM, the number of mixture components K is fixed 
and known. A non-parametric model with L = 1 is the 
Dirichlet Process Mixture Model (DP-MM), which consid¬ 
ers infinitely many mixture components, though only a few 
of them are used for a finite number of datapoints. The mix¬ 
ture distribution 0^ is an infinite-dimensional multinomial, 
drawn from a stick-breaking distribution. The parameters of 
the mixture components are drawn from base distribution H. 

A one-level nonparametric model which does c onsider the 
seque ntial structure of the data is the HDP-HMM |Fox et al, 
20081 . This model considers a set of -distributions from 
which one may be chosen conditioned on the previous as¬ 
signments of Z^. The Z^-assignment to each datapoint i is 
done as Z} ^ Oj with j = where prev{i) is the 

predecessor of the current datapoint in the sequential order 
encoded by {D^}, i.e. prev{i) = %' where D} = D]/ + 1. 


3.2 2-level models 


Next, we move into two-level models, i.e. where L = 2. 
This is the standard setting for document modelling, where 
the word-tokens are grouped into documents (one level of 
grouping). The document membership of the variables are 
encoded by D‘^. The most standard model of this kin d is the 
Latent Dirichlet Allocation (FDA) i Blei et al, 2003| which 
considers K mixture components (topics) {f}, where K is 
fixed and known. Each mixture component fk is a multi¬ 
nomial distribution over the vocabulary of size V. Here the 
level-2 groups (documents) are not clustered, i.e. Z^ is dis¬ 
tinct for each document. Consequently, 0‘^ is not used here, 
and {0^} are group-specific. The Z^-variables of the data¬ 
points within any group j are assigned as IID draws from 6>j. 
Once again, no sequential structure is considered. Note that 
the mixture components f are shared by all groups. 

4>k Dir{l3),k e [l,K]-,d] ^ Dir{a),j e [1,M] 

Z] ~ ~ (/)2i (1) 


A non-parametric generalizati on of FDA is the Hierarchi¬ 
cal Dirichlet Process (HDP) [Teh et al, 2006| , which is 
also a 2-level extension of the DP-MM discussed above. 
Here, the number of components is not fixed or known, 
so the document-specific {6>^}-distributions are infinite¬ 
dimensional, and drawn from a Dirichlet Process/Stick- 
Breaking Process instead of finite-dimensional Dirichlet. 

Another nonparametric 2-level model is the Nested Dirich¬ 
let Process (NDP), where the level-2 groups (documents) are 
clustered using Z^, which are drawn according to a discrete 
distribution Each cluster induced by Z^ uses its own 0^. 
However, unlike the previous models, here the mixture com¬ 
ponents themselves are specific to the clusters induced by Z^. 

fk e H\/k; el - GEM{ni)\/z] 0^ - GEM{n2) 
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Figure 3: Above: HDP and NDP, Below: MLC-HDP and 
STM. The locations of the mixture components and distributions 
in the plate diagrams indicate the type of sharing (full/group- 
specific/cluster-specific) 


z\3) ~ 0\j e [l,M]-,Zl ~ ^ ^ZHD-),ZI (2) 


3.3 3-level models 

Next, we loo k into some 3-level models. MLC-HDP |Wulsin 


et al, 20 is an attempted compromise between HDP and 


NDP, where the groups are clustered (unlike HDP) but mix 
ture components are not cluster-specific (unlike NDP), and 
moreover the data is grouped into 3 levels by observed group 
variables {(D^, D^)}. These groups can be clustered by 

random variables which are drawn from 

discrete distributions respectively. 

A three-level model that considers the sequentia l nature of 
the da ta is the Topic Segmentation Model (TSM) |Du et al~ 


|2013| . Here within each document the sentences are clus¬ 
tered using {Z‘^}, but analogous to HDP-HMM the distribu¬ 
tions are specific to values of Z^. In particular, for any sen¬ 
tence s, is a distribution over two values: Z‘^{prev{s)) and 
Z‘^{prev{s)) -f- 1 (to induce linear clustering/segmentation). 
The {6>^} are specific to the sentence-clusters. The documents 
themselves are not clustered, so 0^,Z^ are not used. 

A s omewhat unusu al case is Subtle Topic Model 
(STM) I Das etal, 2013| which considers multiple document- 
specific distibutions over the mixture components, and dis¬ 
tributions specific to sentences over this set of distributions. 
Here neither the documents nor the sentences are clustered. 
Effectively, only {^^}-distributions are present, which are 
shared across sentences in the same document, but not across 
documents. However, the process of assigning -variables 
requires other sentence-specific variables in addition to {0^}. 


4 DoS-classification of models 

In the above discussion, we have focussed on 3 major aspects- 
1) Number of layers of grouping 2) the way in which the 
mixture components and mixture distributions are shared 3) 
Whether sequential structure is considered or not at different 
layers. Based on these aspects, we propose a nomenclature 
for the models. 


4.1 DoS Concept 

As already discussed, in all the hierarchical Bayesian mod¬ 
els, the mixture components {(/)} and the mixture components 
{6>^} are shared among the different groups. We 
have seen three types of sharing 

1. Full sharing (F): where components/distributions are 
shared by all the groups. For example, in HDP, MLC- 
HDP etc the mixture components are shared by all the 
level-2 groups. 

2. Group-specific sharing (G): where compo¬ 

nents/distributions are specific to groups, and not 
accessible outside the groups. For example, in HDP, 
STM etc the distributions 0^ are specific to the top-level 
groups (documents). 

3. Cluster-specific sharing (C): where the compo¬ 
nents/distributions are specific to clusters of groups, 
but not accessible outside the clusters. For example, in 
MLC-HDP each -distribution is accessible to only 
one cluster of level-2 groups, and each -distribution 
is accessible to only one cluster of level-3 groups. In 
all the models, the mixture components are specific to 
clusters of level-1 groups (as datapoints are clustered by 
the assignment of a mixture component through Z^). 

Based on these notions we introduce Degree-of-sharing 
(DoS). For any given model, we first specify how the mix¬ 
ture components are shared at each of the levels- Full (F), 
group-specific (G) or cluster-specific (C), and we call this the 
DoS of {fi}. The type of sharing at the different levels are 
hyphen-separated. Next, regarding the distributions {0^} at 
each level /, we specify how it is shared by the levels (/ + 1) 
upwards, and we call this the DoS of {0^}. Also, to indicate 
if sequential structure is considered at the different levels, we 
add S to the levels where it is considered. Finally, to indicate 
how groups are clustered at different levels, we add N to the 
levels where there is no clustering of groups, P to the levels 
where the number of clusters is fixed, and NP to the levels 
where the clustering is non-parametric. Note that this indi¬ 
cates the dimensionality of {0^}- P indicates that it is finite¬ 
dimensional, NP indicates it is infinite-dimensional, and N 
indicates it is not in use. 

By combining the DoS of {0}, ... ,0^ in that order, 

we have the DoS-classification of the model. The DoS of 
the different variables are semicolon-separated. The number 
of components in any of these models is (L -j- 1), so that the 
DoS-classification of any model will have (L-f 1) semicolon- 
separated parts. Also, the first part (corresponding to fi) will 
consist of L hyphen-separated letters, and the number of these 
letters will keep decreasing by one for each of the following 
parts (corresponding to 0^, etc), but followed by the letters 
specifying dimensionality and sequence strcuture. 

4.2 Classification of Models 

Let us illustrate the concept of DoS-classification with a case 
study of all the models discussed in Section 

Level-1 parametric models like GMM have mixture- 
components {fi} specific to clusters of datapoints, so that its 
DoS is C. But the mixture distribution is fully shared by 








































































all the datapoints, so that its DoS is F. The number of clus¬ 
ters formed at level 1 (i.e. the dimensionality of 0^) is fixed 
(i.e. P) and sequential structure is not considered. Hence 
the DoS-classification of GMM is C; F — P. In case of DP- 
MM, 0^ is infinite- dimensional (NP), i.e. DoS-classification 
is C; P - NP. 

In case of HDP-HMM, the mixture-components {0} are 
again specific to clusters of datapoints, so that its DoS is C. 
Here, the {0^} are non-parametric (NP), and the sequen¬ 
tial structure is also considered. So, the DoS-classification 
of HDP-HMM is C;F - NP - S. Note that here {0^} is a 
collection of distributions, from which one is chosen for each 
data-point i, depending on the assignment to prev{i). 

Level-2 models: In HDP or LDA, the mixture-components 
are shared cluster-specific in level-1 and fully at level-2, so 
the DoS for (p is C — F. The {0^} are specific to level-2 
groups, so the DoS for is G. Sequential structure is not 
considered at any level. In case of LDA the number of clus¬ 
ters of datapoints (level-1) is fixed (P), and for HDP it is NP. 
The level-2 groups are not clustered (N) in either model. So 
we say the DoS-classification of LDA is G — P; G — P; A^, 
and for HDP it is G — P; G — NP; N. In case of NDP, the 
{0} are cluster-specific at both levels, so its DoS is G — G. 

is specific to clusters of level-2 groups (G), and it is non- 
parametric (NP). The are also non-parametric (NP). So, 
the DoS-classification is G — G; G — NP; NP. 

Level-3 models: In MLC-HDP, the mixture-components 
{(j)} are cluster-specific in level-1, but fully at both levels 2 
and 3, so that its DoS is G — P — P. The {0^} are specific to 
clusters of level-2 groups but fully shared by level-3 groups, 
and they are nonparametric, so the notation is G — P — NP. 
The {0‘^} are specific to clusters of level-3 groups and non- 
parametric, and finally 0^ is nonparametric. So the DoS- 
classification of MLC-HDP isG-P-P;G-P - NP; C - 
NP; NP. 

For Topic-segmentation model, the topics {(j)} are shared 
by all sentences and documents, i.e. its DoS is G — P — P. 
The {0^} are specific to clusters of sentences inside indi¬ 
vidual documents, i.e. the DoS is G — G, and they are of 
fixed dimension (P). The {0‘^} used to cluster sentences is 
document-specific (G). The number of clusters (segments) of 
sentences to be formed is not fixed, and sequential structure 
is also taken into account, so the notation is G — NP — S. 
Finally, the documents themselves are not clustered (N), and 
the DoS-classification of TSM isG — P — P;G — G — P;G — 
NP - S; N. 

Finally we come to Subtle Topic Model (STM), where the 
topics are shared by all sentences and documents, i.e. the DoS 
is G—P—P. The 0^ are shared by all sentences in a document 
but are specific to documents, and they are nonparametric, i.e. 
the notation is P — G — NP. The sentences and documents 
are not clustered, so the DoS-classification of STM isG — 
F - F;F -G - NP;N;N. 

5 Generalized Bayesian Model for Grouped 
Sequential Data 

Having discussed the DoS-classification of various existing 
models, it is clear that despite over a decade of research on 


topic models, there are several DoS-classifications for which 
there are no existing models. But instead of trying to point out 
those classifications individually and propose models follow¬ 
ing them, we now propose a generalized Bayesian Model for 
grouped sequential data. We will show that by specific set¬ 
tings of this model, it is possible to recover all the previously 
discussed models (or their close variants). Other models, not 
explored so far, can also be obtained from it. 

5.1 GBM-GSD 

We consider sequential data with P-levels of grouping, where 
the groups are sequential in every level (eg. in document 
modelling, we will consider the sentences within each doc¬ 
ument, and the documents themselves, are sequentially ar¬ 
ranged). We consider that clustering happens at all levels, i.e. 

{6>^} all exist. To capture the sequential nature, 
we will assume that at every level (say /), there is a collection 
of distributions {0^} from which one can be chosen for each 
group, conditioned on the previous assignments (as consid¬ 
ered in sHDP-HMM and TSM). We also consider that all the 
distributions are infinite-dimensional (i.e. NP), i.e. neither 
the number of mixture components nor the number of clusters 
formed at each level is known in advance. We also consider 
that all the mixture components are accessible to all the level- 
2 groups, but introduce a binary random vector Bi specific 
to each datapoint. This vector indicates which all mixture 
components are accessible to each datapoint. We will show 
that using this vector, we can make the mixture components 
group-specific or cluster-specific, and also capture other more 
intricate structures that would not be possible without it. The 


Algorithm 1 Generalized Bayesian Model for Group Sequen¬ 
tial Data (GBM-GSD) 

1 : (f)k ^ H,yk 

2: for g = 1 : Gl do 

3: z^(g) ~ 9^\Z^{1), . . . , Z^{prev{g)) 

4: end for 

5: for Z = L — 1 : 2 do 
6: for = 1 : Gi do 

7: Z\g) ~ e]\Z\l),...,Z\prev{g))wh^vtj = Z^^\D^+\g)) 

8: end for 

9: end for 

10: for i = 1 : AT do 

11 : Bi = 

12: BiO e]\Zl, . . . , where j = Z‘^{D‘1) 

13: Yi ^ (pk where k = Zj 

14: end for 


generative process hierarchically clusters the groups from top 
to bottom level. At every intermediate level I, it assigns Z^(g) 
to each group g at level 1. But for that it will have access 
to only those -distributions, that are specific to the cluster 
of group ^ as a result of the clustering at level (I -f 1). 
If group g is part of group m = (g) at level (/ + 1) then 

the ^^-distributions corresponding to Z^+^(m) must be used. 
Finally, at level 1, each datapoint i is assigned a binary vector 
Bi conditioned on the P-vectors corresponding to all previ¬ 
ous datapoints. The distribution 0^ is convoluted with this 
vector Bi, so that a subset of the components are available to 
datapoint i. 








5.2 Recovery of Existing Models 

The level-1 models can be recovered easily. By setting Bi 
as a vector of all 1-s for all the datapoints, and by making 
0^ conditioned only on and GEM-distributed we get 

back HDP-HMM. In case is also independent of the pre¬ 
vious assignments, we can have DP-MM, and if it is finite¬ 
dimensional it will GMM provided the base distribution H is 
Gaussian. 

When L = 2, to recover HDP we need to define 0^ such 
that Z^{g) = g for all the groups g, so that groups are not 
clustered. Then we again set Bi to be the vector of all 1-s, 
and make {0^} independent of all previous assignments of 
The {0^} should be drawn from a GEM. If the {0^} are 
finite-dimensional and drawn IID from a Dirichlet, and if all 
the {(pk} are also drawn from a Dirichlet, then we have EDA. 

NDP involves nonparametric clustering of level-2 groups 
without sequential ordering, so generation of Z^{g) should 
be independent of previous assignments, and 0^ should be 
drawn from a GEM. NDP also has the special characteris¬ 
tic that the different level-2 clusters do not share the same 
mixture components. This can be managed by setting B^ 
through an appropriate function /, which will return a vec¬ 
tor with 0 for those mixture components that have been as¬ 
signed in other level-2 clusters, i.e. Bik = 0 if 3j such that 
Zl, ^ Zl, and Z] = k. 

3 * 

When L = 3, MLC-HDP can be recovered by removing 
the conditioning on previous assignments in the assignment 
of Z‘^ and and by setting B^ to be the vector of all 
1-s. The {0^} should be drawn from GEMs. Eor TSM, 0^ 
should ensure that documents are not clustered, Bi should be 
the vector of all 1-s, and assignment of should be indepen¬ 
dent of all previous assignments. Regarding should 

ensure that for any sentence (level-2 group) g, Z‘^{g) should 
be either Z‘^{prev{g)) or Z‘^{prev{g)) + 1. 


6 News Transcript Segmentation 

We want to extend the generative framework for grouped se¬ 
quential data (Algorithm 1) for modeling news transcripts. 
This data is hierarchical since there are broad news categories 
like politics, sports etc, under which there are individual sto¬ 
ries or topics. In the Bayesian approach, we consider mix¬ 
ture components {p} that correspond to these stories, and 
the broad categories are represented with distributions {0^} 
over these stories. As usual, each -distribution is specific 
to a level-2 cluster (segment), and such clustering is induced 
by {6>^}, specific to the level-3 groups (the transcripts). The 
transcripts are not clustered. The observed datapoints Yi are 
word-tokens, each represented as an integer (index of the 
word in the vocabulary). We define prev{i) = i — — 

are in the same sentence, otherwise prei;(i) = — 1. Similarly, 
next{i) is defined within sentences. Also, prev and next are 
defined for sentences. Z} indicates the news story (level-1) 
and Z? indicates the news category (level-2) that token i is 
associated with. Each sentence is a level-2 group. 


6.1 LaDP model for news transcripts 

News transcripts were first modelled by Layered Dirichlet 
Process (LaDP) i Mitra et ah, '20T^ . Several versions of 


this model was proposed, with different combinations of ex¬ 
changeability properties at the different layers (which in¬ 
cluded MLC-HDP). Here, the level-2 groups were not sen¬ 
tences, but the word-tokens themselves. In the most suc¬ 
cessful models, sequential structure was considered at both 
level-1 and level-2, i.e. the assignment of Z} and Z]^ are con- 
ditioned on and respectively. The cluster- 

ing at both level-2 and level-1 are nonparametric. The DoS- 
classification of LaDP C — F — F] C — F — NP — S;F — 
NP - S; N. 


6.2 Modeling Temporal Structure 

News transcripts have characteristic temporal features regard¬ 
ing assignments of Z^ and Z^ for which the GBP-GSD needs 
to be modified appropriately. These features are discussed be¬ 
low. LaDP is insufficient for news transcripts, because it does 
not capture all of them. 

Number of Level-2 clusters (segments) are fixed and 
known. In case of news transcripts from a particular source, 
it can be expected to have K news categories in fixed order 
(say politics, national affairs, international affairs, business 
and sports, in that order). Segmentation is the task of linear 
clustering of words/sentences, i.e. each word/sentence s can 
be assigned to either or to + 1. In LaDP, 

each datapoint i is assigned a value of Z} and Z? based on 
the assignments of prev{i), and segmentation happens based 
on these assignments. But this does not guarantee the forma¬ 
tion of K segments. To overcome this issue, let it be known 
to model that the observed data sequence has K level-2 seg¬ 
ments. Then the sequence can be partitioned into K parts 
of sizes • • • 5 Nr- These sizes may be modeled by 

a Dirichlet distribution where the parameters jk signify the 
relative lengths/importance of the news categories. 


Nk ^ 

^ AT ’ ■ ■ ■ ’ N ^ 


Dir{ji, . . . , 7k); Zj = s 


where 


<j <y^Nk 


(3) 


In the GBS-GSD, the needs to be defined as a determinis¬ 
tic function, conditioned on ..., Nk}- 

Topic Coherence has been considered in various text seg¬ 
mentation p aper like jEisenstein and Barzilay, 200^ jMitra 
et al, 201 3|. This is the property that within the same level-2 
segment, successive datapoints are likely to be assigned to the 
same topic (mixture component). This can be easily modelled 
by the Markovian approach, i.e. 

~ + (1 - p)iBi o 0g)wheres = (4) 

prev{i) i 


This means that the i-th datapoint can be assigned the Z^- 
value of its predecessor pred{i) with probability p, or any 
value with probability (1 — p). The other available values are 
dictated by Bj sls discussed next . This is similar to the BE 
mixture model i Mitra et al, 20T3l . 

Level-2 segments do not share mixture components, be¬ 
cause each individual news story (topic) can come under only 
one news category. Also, Topics do not repeat inside a 
Level-2 segment. Inside a level-2 segment s, successive dat¬ 
apoints are expected to be assigned to the same mixture com¬ 
ponent due to temporal coherence. However, in news tran¬ 
script, a news story will be told only once, which means that 
















a particular component may be present only in a single chunk, 
and cannot reappear in non-contiguous parts of the segment. 
For this purpose the generative process needs to be manipu¬ 
lated through Bi. Initially we set Bi to be all Is, and when¬ 
ever a component fk is sampled for any point, we set Bik = 0 
for all following points in the segment, so that fk cannot be 
sampled again. The generative process is as follows: 


Algorithm 2 Generative Model for News Transcripts 

1 : He -- DiriP)Vc 
2 : c^UiK),cf)k ^ Heik)^k 
3: el ~ GEM(a) where s G [1, K] 

4: for = 1 to do 

5: Bgk = i\fk 

6 : 

7: end for 

8: for jf = 1 to G^ do 

9: Zj=s based on (Ngi, . . . , NgK) where g = 

10: end for 

11: for i = 1 : AT do 

12: ifz^ setp = o 

(prev(z)) ^ 

13: Zl pS^i + (1 — p)(Bg o where s = Z^ 2 ’ S'= 

prev{i) i 

14: if {zl 7^ ^pret;(i)) ^ 9 ^ = 0 where k = Zf g = 

15: Vi ~ mult{(f)k) where k = Zj 

16: end for 


Here is the number of transcripts, and the num¬ 
ber of sentences across all the transcripts. Clearly this model 
has 3 levels, and sequential structure is considered at level 
2 (sentences) and at level 1 (word-tokens). Any topic k be¬ 
longs to a broad category c{k) (G {1,..., iT} uniformly at 
random), and corresponding to each category we have a base 
distribution He, which in turn are all drawn from a common 
base distribution Dir{f3). This helps to capture the fact that 
mixture components are specific to level-2 segments. The 
documents are not clustered, the sentences are clustered (seg¬ 
mented) with fixed number of segments, and the number of 
topics (word-clusters) is not fixed. The topics are shared 
across all transcripts, but are specific clusters of sentences, the 

-distributions are specific to level-2 segments (clusters of 
sentences) but shared across transcripts, the -distributions 
are transcripts-specific (parametrized by {A^g}) and 0^ are not 
used. So the DoS-classification for the generative model of 
news transcripts is G—G—F; G—F—NP—S; G—P—S; N. 

6.3 Inference Algorithm 

We now discuss inference for this model. We need an infer¬ 
ence algorithm which ensures that K segments are formed. 
We start with the joint distribution. 

piY, Z\ B, N, P, {9}, {H}) oc P({JVJ) P(^'a) 

X ric n, n - ■ ■ ■ - 

We can collapse the variables and {0^}, and per- 

form Gibbs Sampling. The key feature of this likelihood 
function is the presence of the {Ag} variables. To han¬ 
dle these, we introduce auxiliary variables Jgi,..., Ig^K-i 
which are the level-2 changepoints, i.e. the set of datapoints 


{i} at which Zf ^ Z'^rev{iy U} and 

{A} are deterministically related. We introduce the / vari¬ 
ables to simplify the sampling. We initialize the Z‘^ variables 
by sampling a level-2 segmentation of the datapoints into K 
segments. The B and variables are sampled accordingly. 
In each iteration of Gibbs Sampling, we consider the state- 
space of Jg^ as/gs G {Ig,s-i, • •. ,/g,s+i},i.e. the level-2 po¬ 
tential changepoints lying in between Ig,s-i and /g,s+i. The 
process is described in Algorithm]^ Here, Bgs = {Bget} 
where set = {i : Df = g, Zf = s}, i.e. the set of data¬ 
points in transcript g in segment s. (similarly Zg^, Zg^^Ygs) 
The major part in the Gibbs sampling is to sample the values 
({B}s, {Z^}s) for any segment s, conditioned on the remain¬ 
ing B and Z^ variables. This can be done using the Chinese 
Restaurant Process (CRP), where any component k may be 
sampled for Zj (where datapoint i is within segment s) pro¬ 
portional to the number of times it has been sampled, pro¬ 
vided Bpeev{i),k = 1- The procedure is detailed in Algorithm 
3, which is called Global Inference as it considers the overall 
structure of the transcript (as described in Sec|6.2|). 


Algorithm 3 Global Inference Algorithm by Blocked Gibbs 
Sampling (GI-BGS) 


1 

for transcript = 1 to do 


2 

Initialize Ig with {K — 1) points by sampling from Dir{''))\ 


3 

Set according to /; 


4 

Initialize {B},{Z^} variables; 


5 

end for 


6 

Estimate components 0 ^ £^(0|Z,B,S',y); 


7 

while Not Converged do 


8 

for transcript = 1 to do 


9 

for segment s = 1 : iC do 


10 

Igs G {succ{Ig^s-i), ■ ■ ■ ,pred{Ig^s+i)} 

oc 

11: 

: update Z^ according to I 


12; 

oc 

13 

end for 


14 

end for 


15 

Update components 4> ^ E((f)\B, Z^, y); 


16 

end while 



7 Experiment on News Transcript 
Segmentation 

For news trans cript segmentation, we used the news tran¬ 
scripts used by jMitra et al, 201 3| for hierarchical segmen¬ 
tation. Here, each transcript has 4-5 news categories- poli¬ 
tics, national affairs, international affairs, business, sports- in 
that fixed order. Overall, each transcript is about 5000 tokens 
long (after removal of stop words and infrequent words), and 
has about 40 news stories, spread over the 5 categories. The 
task is to segment the transcript at two levels. At level 1, each 
segment should correspond to a single story, while at level 
2, each segment should correspond to a news category. The 
endpoints of the sentences are assumed to be known (these 
can be figured out based on pause durations in speech-to-text 
conversion), and are used to define level-2 groups. There are 
about 300-350 sentences per transcript. 

In this dataset, the datapoints per sequence are too few 
in number to learn the level-1 mixture components (topics). 













Moreover, as already explained, each story occurs only once 
in a transcript, thus reducing learnability. Hence, we con¬ 
sidered 60 randomly chosen transcripts, and using initial seg¬ 
mentations of each sequence by the level-1 changepoints, 136 
topics were learnt using HDR These topics form our initial 
estimate of using which we performed inference on in¬ 
dividual sequences. The inference provides us with the 
and variables, based on which we can infer the segmen¬ 
tation at the two levels. We have gold standard segmentation 
available at both layers, and so we compute the segmentation 
errors {SI, S2) at both layers. S 1 and S2 can be com puted by 
taking the average P/.-measure jMitra et al, 2013| for three 
different values of k, namely the maximum, minimum and 
average lengths of gold-standard segments (level-1 segments 
for SI and level-2 segments for S2). 

We can look upon segmentation as a retrieval problem, and 
define the Precision and Recall of level-2 segments (PR2 and 
RC2), and also for level-1 segments (PRl and RCl). Let i 
and j be the starting and ending points of an inferred segment 
s, i.e. .) = ... = z] = s, but ^ s 

and ^ s. Then, if there exists {i0,j0, sO) such that 

(iO, jO) defines a gold-standard segment sO satisfying \i — 
i0\ < k and \j — j0\ < k, then inferred segment {i,j,s) 
is aligned to gold standard segment (iO, j0,s0). Precision, 
recall of a segmentation are defined as 


Precision = 

Recall = 


#inferred segments aligned to a gold-standard segment 
#inferred segments 

#gold-standard segments aligned to an inferred segment 
#gold-standard segments 


For level-2, the alignment threshold is set to 500, and at l evel- 
1 it is set to 1 0. As a base line, we use sticky H DP-HMM |Fox 
et al, 200^ and LaDP |Mitra et al, 2013| at level-1, once 
again using the 136 HDP topics. For level-2, LaDP is the 
only baseline. We use the BE-B E-CE version, sinc e that is 
the most successful according to jMitra et al.,2Q\2\ . 

From the 60 news transcripts from which we learnt the 136 
topics through HDP, we selected 3 (Trans 1, Trans2, Trans3) 
to report the segmentation. Also, we selected another 3 
(Trans91, Trans92, Trans93) from outside the learning set, 
for which we used the same initial values of T>. The results 
are reported in Table 1. It is clear that in terms of all the mea¬ 
sures we considered, GI-BGS outperformed both competitors 
at level-1. At level-2 also, GI-BGS is competitive on the three 
measures on all the transcripts except Trans 1. 


8 Conclusions 

We carried out a study of various Bayesian models for hi¬ 
erarchically grouped data with emphasis on how they share 
mixture components and distributions among the groups. We 
also introduced the notion of Degree-of-Sharing (DoS) as a 
nomenclature for such models. We described a Generalized 
Bayesian model for this type of data, and showed how various 
existing models can be recovered from it. Next we used it to 
develop a new model for news transcripts, which has several 
peculiar temporal structures, and also provided an inference 
algorithm for hierarchical unsupervised segmentation of such 
transcripts. We showed that this model can outperform the 
existing LaDP model for this task. The DoS concept opens 


Data 

GI-BGS 

LaDP 

sHDP-HMM 


PRI 

RCl 

Sl 

PRl 

RCl 

Sl 

PRl 

RCl 

Sl 

TransI 

0.38 

0.46 

0.06 

0.33 

0.46 

0.07 

0.20 

0.40 

0.08 

Trans2 

0.33 

0.37 

0.10 

0.27 

0.34 

0.11 

0.18 

0.34 

0.12 

Trans3 

0.26 

0.41 

0.09 

0.25 

0.41 

0.08 

0.13 

0.32 

0.11 

Trans91 

0.15 

0.28 

0.16 

0.13 

0.28 

0.13 

0.06 

0.21 

0.16 

Trans92 

0.14 

0.25 

0.14 

0.10 

0.20 

0.14 

0.08 

0.23 

0.14 

Trans93 

0.22 

0.22 

0.09 

0.17 

0.08 

0.11 

0.12 

0.03 

0.11 

Data 

PR2 

RC2 

S2 

PR2 

RC2 

S2 




TransI 

0.20 

0.20 

0.08 

0.33 

0.40 

0.11 




Trans2 

0.80 

1.00 

0.04 

0.71 

1.00 

0.01 




Trans3 

LOO 

1.00 

0.13 

1.00 

1.00 

0.04 




Trans91 

0.60 

0.60 

0.06 

0.50 

0.80 

0.07 




Trans92 

0.60 

0.60 

0.05 

0.20 

0.40 

0.04 




Trans93 

0.60 

0.75 

0.04 

0.50 

0.75 

0.06 





Table 1: Above: Comparison of news transcript segmentation at level-1 by sticky 
HDP-HMM, LaDP and GI-BGS. Below: News transcript segmentation at level-2 by 
LaDP and GI-BGS. Lower value of Sl, S2 indicate better segmentation. 


up possibilities to explore models with DoS-classifications 

that have not yet been considered, and the GBM-GSD can 

be used to capture complex temporal structures in the data. 
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